<p float="left">
  <img src="PlantHub-full-rgb.png" style="height:100px" alt="PlantHub logo" class="light-logo">
  <img src="PlantHub-full-white.png" style="height:100px" alt="PlantHub logo" class="dark-logo">
  <img src="gfoe.png" style="height:100px" alt="gfö logo" class="light-logo">
  <img src="gfoe_inv.png" style="height:100px" alt="gfö logo" class="dark-logo">
  <img src="NFDI4Biodiversity.svg" style="height:100px" alt="NFDI logo" class="light-logo">
   <img src="NFDI4Biodiversity_text_inv.svg" style="height:100px" alt="NFDI 2023 logo" class="dark-logo">
</p>

# Vernacular name matching

This workbook shows how to find scientific names for a given list of vernacular names using the [GBIF](https://www.gbif.org/) database. While it was tedious to get lists of scientific names with attached vernacular names in the past, the GBIF API offers a fast and convenient way to get corresponding scientific names for vernacular names for any kind of living beings. In the hands on-part of this notebooks, you will try to figure a way to select the best result of the bunch of results you get back from the API. You will also seek to speed up the scientific name matching process by running it in parallel.

## Prerequisites

To run the code presented here, you will need 
- the sample names list provided in the workshop,
- a functioning R environment and 
- the R packages `data.table`, `rgbif`, `RJSONIO`, and `doSNOW` installed.

## Code

The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.

In [1]:
# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing
library(RJSONIO) # parse JSON

# clear workspace
rm(list = ls())

# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))

# load data
vernNames <- fread("vernacular names_2024-04-09.txt", sep = "\t")


Lade nötiges Paket: foreach

Lade nötiges Paket: iterators

Lade nötiges Paket: snow



Let's look at the data.

In [2]:
str(vernNames)


Classes 'data.table' and 'data.frame':	1000 obs. of  1 variable:
 $ vernacularName: chr  "Bermuda Cress" "Bibernelle, Große" "Mandarine" "Lavendel, Französischer" ...
 - attr(*, ".internal.selfref")=<externalptr> 


As can be seen, the names found in the file are a mixture of English and German vernacular names. For simplicity, they are all plant names, but it would not make a difference if there were animals or fungi names included. The names used here were gathered from an [English wikipedia page](https://simple.wikipedia.org/wiki/List_of_plants_by_common_name) and a [German website](http://www.pflanzenliebe.de/innen/innen_liste_deutsch.html) of vernacular plant names. Both pages also include the scientific names of plants, which can serve as a check to the results obtained here.


### Encoding

Unfortunately, when getting data from differing sources, we will often find that these data have been [encoded](https://en.wikipedia.org/wiki/Binary-to-text_encoding) in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system. 

We will deal with the most common case: Data being stored in the Windows-specific [CP-1252 encoding](https://en.wikipedia.org/wiki/Windows-1252) (mislabeled ANSI or latin1 sometimes) and not in [UTF-8](https://en.wikipedia.org/wiki/UTF-8).

How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:

In [3]:
Sys.getlocale()


If your console has no UTF-8 setting (no matter the language) you may change it like this:

`Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")`

You can use another encoding, too, but it may throw errors later on. So let's check whether the data comes in UTF-8, and if not, let's repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).

In [4]:
# check whether correct encoding is UTF-8
table(validUTF8(vernNames$vernacularName))



TRUE 
1000 

That looks all good, so there is nothing to do here. Otherwise we would apply the following:

`plants[!validUTF8(vernacularName), newName := iconv(vernacularName, from = "CP1252", to = "UTF-8")]`

converting all non-UTF-8 characters to UTF-8.

### TRY vernacular name matching with the `rgbif` package
As there is the `rgbif` package available to query GBIF, one would assume that a function from therein can be used to retrieve scientific names for the vernacular names.

In [5]:
name_lookup("Bermuda Cress", limit = 5)
name_lookup("Gänseblümchen", limit = 5)
name_lookup("Asiatischer Elefant", limit = 5)
name_lookup("cotton", limit = 5)


Records found [1] 
Records returned [1] 
No. unique hierarchies [1] 
No. facets [0] 
No. names [1] 
Args [q=Bermuda Cress, limit=5, offset=0] 
[90m# A tibble: 1 × 21[39m
        key scientificName    datasetKey parentKey parent genus species genusKey
      [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m             [3m[90m<chr>[39m[23m          [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m      [3m[90m<int>[39m[23m
[90m1[39m 165[4m6[24m[4m2[24m[4m6[24m894 Barbarea verna (… cbb6498e-… 165[4m6[24m[4m2[24m[4m6[24m881 Barba… Barb… Barbar…   1.66[90me[39m8
[90m# ℹ 13 more variables: speciesKey <int>, canonicalName <chr>, authorship <chr>,[39m
[90m#   nameType <chr>, taxonomicStatus <chr>, rank <chr>, origin <chr>,[39m
[90m#   numDescendants <int>, numOccurrences <int>, habitats <lgl>,[39m
[90m#   nomenclaturalStatus <lgl>, threatStatuses <lgl>, synonym <lgl>[39m

Records found [50] 
Records returned [5] 
No. unique hierarchies [5] 
No. facets [0] 
No. names [5] 
Args [q=Gänseblümchen, limit=5, offset=0] 
[90m# A tibble: 5 × 34[39m
        key scientificName datasetKey nubKey parentKey parent order family genus
      [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m       [3m[90m<int>[39m[23m     [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m
[90m1[39m 100[4m3[24m[4m3[24m[4m6[24m410 Bellis L.      16c3f9cb-… 3.12[90me[39m6 223[4m9[24m[4m8[24m[4m0[24m151 Aster… Aste… Aster… Bell…
[90m2[39m 116[4m7[24m[4m8[24m[4m1[24m451 Bellis L.      d027759f-… 3.12[90me[39m6 116[4m7[24m[4m8[24m[4m0[24m684 Aster… Aste… Aster… Bell…
[90m3[39m   3[4m1[24m[4m1[24m[4m7[24m399 Bellis L.      d7dddbf4-… 3.12[90me[39m6      [4m3[24m065 Aster… Aste… Aster… Bell…
[90m4[39m 100[4m3[24m[4m4[24m[4m1[24m50

Records found [14] 
Records returned [5] 
No. unique hierarchies [5] 
No. facets [0] 
No. names [4] 
Args [q=Asiatischer Elefant, limit=5, offset=0] 
[90m# A tibble: 5 × 36[39m
       key scientificName datasetKey  nubKey parentKey parent order family genus
     [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m        [3m[90m<int>[39m[23m     [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m
[90m1[39m   1.00[90me[39m8 Elephas maxim… 16c3f9cb-… 5[4m2[24m[4m1[24m[4m9[24m461 223[4m9[24m[4m7[24m[4m8[24m954 Eleph… Prob… Eleph… Elep…
[90m2[39m   1.96[90me[39m8 Elephas maxim… 23a3fa4c-… 5[4m2[24m[4m1[24m[4m9[24m461 225[4m2[24m[4m0[24m[4m9[24m602 Eleph… Prob… Eleph… Elep…
[90m3[39m   5.22[90me[39m6 Elephas maxim… d7dddbf4-… 5[4m2[24m[4m1[24m[4m9[24m461   2[4m4[24m[4m3[24m[4m5[24m351 Eleph… Prob… Eleph… Elep…
[90m4[39m   1.65[90m

Records found [6458] 
Records returned [5] 
No. unique hierarchies [5] 
No. facets [0] 
No. names [0] 
Args [q=cotton, limit=5, offset=0] 
[90m# A tibble: 5 × 34[39m
      key scientificName nameKey datasetKey nubKey parentKey parent phylum order
    [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m       [3m[90m<int>[39m[23m     [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m
[90m1[39m  1.21[90me[39m8 Neodiastoma C… 7[4m4[24m[4m3[24m[4m6[24m314 c33ce2f2-… 4.61[90me[39m6 121[4m2[24m[4m8[24m[4m5[24m847 Pareo… Mollu… Sorb…
[90m2[39m  4.61[90me[39m6 Callitriphora… 1[4m8[24m[4m3[24m[4m8[24m041 d7dddbf4-… 4.61[90me[39m6      [4m2[24m660 Triph… Mollu… [31mNA[39m   
[90m3[39m  1.22[90me[39m8 Belgradeophyl… 1[4m4[24m[4m1[24m[4m1[24m263 c33ce2f2-… 4.88[90me[39m6 121[4m5[24m[4m1[24m[4m7[24m396 Rugosa Cnida… [31mNA[39m   
[

### Creating a custom vernacular name matching function using the GBIF API

From the examples shown here, we see that data is returned, and for some names, as "Bermuda Cress" or the Asian Elephant ("Asiatischer Elefant"), there are a limited number of records available, and we can easily select the (first) correct one. In the last example, "cotton", however, we get over 6000 matches, and it is impossible to select the correct one easily. The problem here is that `name_lookup` does not return the actual matched vernacular names. It searches in all fields, and the first results returned have "Cotton" as the author of the scientific name, not as a vernacular name. The biggest problem is that the results do not include the actual vernacular names, even if they were used in matching, which impedes a subsequent filtering. As this is not what we want, we will need to access the GBIF API directly instead of relying on the `rgbif` package.

Fortunately, that is not difficult, as soon as we know the syntax. In our call, we will make sure that the name query is only done on vernacular names.

In [6]:
# define the search term
searchName <- "cotton"
# define the maximum number of results per query
nRes <- 100

# directly call the GBIF API
res <- fromJSON(paste0(
	"https://api.gbif.org/v1/species/search?q=",
	searchName, "&offset=400&qField=VERNACULAR&limit=", nRes
))

# check result
names(res)
res[c(1:4, 6)]
length(res$results)


$offset
[1] 400

$limit
[1] 100

$endOfRecords
[1] TRUE

$count
[1] 477

$facets
list()


As we can see, the API call returns a list with six elements. 
- The first element tells us the the <b>index of the first retrieved element minus one</b>. So if the offset is 10, we will get the results from the 11th match onwards. We can set the offset in the API call and this will allow us to retrieve all results even if there are more than 1000 later on. 
- The second element tells us what the <b>maximum number of returned results</b> is. Note that we have defined that number ourselves in nRes. GBIF will ignore any number > 1000 and set it to 1000. 
- The third element tells us <b>whether we have included the last match</b> in our returned results. In case our limit is smaller than the number of results, this will only be the case if we use an offset to include the last match.
- The fourth element tells us <b>how many matches</b> to our query were found, it corresponds to "Records found" in the `name_lookup` function. 
- The fifth element contains the actual <b>results</b>. Its length corresponds to the "Records returned" in the `name_lookup` function. 

We will ignore the last element that is not relevant for us here. Let us now look at the results.

In [7]:
length(res$results)
res$results[[1]]


$key
[1] 176665880

$datasetKey
[1] "19491596-35ae-4a91-9a98-85cf505f1bd3"

$nubKey
[1] 2436469

$parentKey
[1] 224008563

$parent
[1] "Saguinus"

$kingdom
[1] "ANIMALIA"

$phylum
[1] "CHORDATA"

$order
[1] "PRIMATES"

$family
[1] "CALLITRICHIDAE"

$genus
[1] "Saguinus"

$species
[1] "Saguinus oedipus"

$kingdomKey
[1] 223993981

$phylumKey
[1] 223994122

$classKey
[1] 224006554

$orderKey
[1] 224008432

$familyKey
[1] 224008550

$genusKey
[1] 224008563

$speciesKey
[1] 176665880

$scientificName
[1] "Saguinus oedipus (Linnaeus, 1758)"

$canonicalName
[1] "Saguinus oedipus"

$authorship
[1] "(Linnaeus, 1758)"

$nameType
[1] "SCIENTIFIC"

$taxonomicStatus
[1] "ACCEPTED"

$rank
[1] "SPECIES"

$origin
[1] "SOURCE"

$numDescendants
[1] 0

$numOccurrences
[1] 0

$habitats
list()

$nomenclaturalStatus
list()

$threatStatuses
[1] "CRITICALLY_ENDANGERED"

$descriptions
list()

$vernacularNames
$vernacularNames[[1]]
         vernacularName                language 
"Cotton-headed Tamarin"       

From the first result we see here we notice that the each individual result is a list with single elements, except for the elements "vernacularNames" and "higherClassificationMap". We will focus on the vernacular names. To extract the information needed, we have to consider that each result can store a variable number of vernacular names. To transfer everything into a dataframe structure, we will have to create as many rows as there are vernacular names for each result, repeating the information from the other fields. For simplicity, we will not process the information in the "higherClassificationMap" variable. We also have to account for the maximum number of results to be retrieved so that we get all information for a certain name. The code below does this job. It conveniently stores the data in a data.table object

In [8]:
# define variables to extract (you may modify this depending on your needs)
resVars <- c(
	"canonicalName", "authorship", "scientificName", "genus", "family", "order", "class", "phylum", "kingdom",
	"key", "nubKey", "nameType", " taxonomicStatus", "rank", "origin"
)

# call the GBIF API
res <- fromJSON(paste0("https://api.gbif.org/v1/species/search?q=", searchName, "&qField=VERNACULAR&limit=", nRes))

resTable <- data.table(vernacularName = character())
for (i in seq_along(resVars)) {
	resTable[, new := character()]
	colnames(resTable)[ncol(resTable)] <- resVars[i]
}

# calculate number of queries given nRes
nRuns <- res$count %/% nRes + ifelse(res$count %% nRes > 0, 1, 0)
for (i in seq_len(nRuns)) {
	# query data
	if (i > 1) {
		res <- fromJSON(paste0(
			"https://api.gbif.org/v1/species/search?q=", searchName,
			"&qField=VERNACULAR&limit=", nRes, "&offset=", (i - 1) * 20
		))
	}
	# structure data
	res <- res$results
	for (j in seq_along(res)) {
		# extract the vernacular name (first element of each vernacularNames element) from the data
		# you could also extract the second element, which is the language abbreviation
		temp <- data.table(vernacularName = sapply(res[[j]]$vernacularNames, function(x) x[1]))
		# fill the remaining fields
		for (k in seq_along(resVars)) {
			if (resVars[k] %in% names(res[[j]])) {
				temp[, new := res[[j]][[which(resVars[k] == names(res[[j]]))]]]
			} else {
				temp[, new := character()]
			}
			colnames(temp)[ncol(temp)] <- resVars[k]
		}
		resTable <- rbind(resTable, temp)
	}
}

resTable


vernacularName,canonicalName,authorship,scientificName,genus,family,order,class,phylum,kingdom,key,nubKey,nameType,taxonomicStatus,rank,origin
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
cottonthistle,Onopordum,L.,Onopordum L.,Onopordum,Asteraceae,Asterales,Magnoliopsida,Tracheophyta,Plantae,102236399,3094883,SCIENTIFIC,,GENUS,SOURCE
Cotton Thistles,Onopordum,L.,Onopordum L.,Onopordum,Asteraceae,Asterales,Magnoliopsida,Tracheophyta,Plantae,160783497,3094883,SCIENTIFIC,,GENUS,SOURCE
Cottonthistle,Onopordum,L.,Onopordum L.,Onopordum,Asteraceae,Asterales,Magnoliopsida,Tracheophyta,Plantae,160783497,3094883,SCIENTIFIC,,GENUS,SOURCE
Bog Cotton,Eriophorum,L.,Eriophorum L.,Eriophorum,Cyperaceae,Poales,Liliopsida,Tracheophyta,Plantae,160786775,2730118,SCIENTIFIC,,GENUS,SOURCE
Bog-Cotton,Eriophorum,L.,Eriophorum L.,Eriophorum,Cyperaceae,Poales,Liliopsida,Tracheophyta,Plantae,160786775,2730118,SCIENTIFIC,,GENUS,SOURCE
Cottongrass,Eriophorum,L.,Eriophorum L.,Eriophorum,Cyperaceae,Poales,Liliopsida,Tracheophyta,Plantae,160786775,2730118,SCIENTIFIC,,GENUS,SOURCE
Bog Cotton,Eriophorum,L.,Eriophorum L.,Eriophorum,Cyperaceae,Poales,Liliopsida,Tracheophyta,Plantae,206228250,2730118,SCIENTIFIC,,GENUS,SOURCE
Bog-Cotton,Eriophorum,L.,Eriophorum L.,Eriophorum,Cyperaceae,Poales,Liliopsida,Tracheophyta,Plantae,206228250,2730118,SCIENTIFIC,,GENUS,SOURCE
Cottongrass,Eriophorum,L.,Eriophorum L.,Eriophorum,Cyperaceae,Poales,Liliopsida,Tracheophyta,Plantae,206228250,2730118,SCIENTIFIC,,GENUS,SOURCE
bog cotton,Eriophorum,Linnaeus,Eriophorum Linnaeus,Eriophorum,Cyperaceae,Poales,Equisetopsida,,,100014655,2730118,SCIENTIFIC,,GENUS,SOURCE


We can see that our results table is quite large. Fortunately, we have some information we can use to extract the data that we want. Most importantly, the "vernacularName" column stores the actual vernacular names. As we see in the example, the word "cotton" appears in a lot of names, including in animals or fungi. As a first step we can reduce the results to plants only.

In [9]:
nrow(resTable)
table(resTable$kingdom)
resTable <- resTable[kingdom %in% c("Plantae", "PLANTAE")]
nrow(resTable)



     Animalia      ANIMALIA       Metazoa       Plantae       PLANTAE 
          392            39            15          1697            32 
Viridiplantae 
           31 

This has reduced the number of results from about 2200 to 1700. 

>TASKS:
>1) Try to find other ways to reduce the number of results. Ideally, you should keep one result only per name. 
>2) It would also be a good idea to pack the code in a function, so that it can easily be applied in a loop. 
>3) To further increase the quality of the matching, you might want to consider to check the vernacular names and apply some kind of pre-processing to them.

### Parallel processing

As the matching process will take quite some time for each name, it makes sense to parallelize it. An example on how to parallelize a the execution of a function can be found below. Let's first check how many cores are available on our system.

In [10]:
parallel::detectCores()


It is unlikely that you have so many cores available, but from former trials with the GBIF API I can tell you that is is wise to limit the core number to 24 at maximum. In this exercise, I will reduce the number of cores used by 1 to avoid blocking my computer for other tasks while the loop is running.

In [11]:
# a test function
testFunction <- function(x) {
	if (x < 0) {
		return(1000 %% (-x))
	} else if (x > 0) {
		return(1000 %% x)
	} else {
		return(0)
	}
}

# create test data
testData <- seq(-1000, 1000)

# create results vectors
resSeq <- rep(NA, length(testData))
resPar <- rep(NA, length(testData))

# sequential loop
startTime <- Sys.time()
for (i in seq_along(testData)) {
	resSeq[i] <- testFunction(testData[i])
}
Sys.time() - startTime

# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)

# parallel loop
startTime <- Sys.time()
resPar <- foreach(i = seq_along(testData), .combine = c) %dopar% {
	# note that the result of each loop execution will be returned and stored in resPar eventually
	# however, if anything happens in the loop, it will be lost
	testFunction(testData[i])
}
Sys.time() - startTime

# stop the cluster
stopCluster(cl)

# test whether results are identical
all(resSeq == resPar)


Time difference of 0.01140094 secs

Time difference of 1.056732 secs

As we can see, in our little example, the use of parallel processing was not necessary. It needs so much time to set up the parallel processing that there is no gain from it. Whenever the task becomes more complicated and takes more time, this will, however, pay back.

>TASKS:
>
>4) Implement the parallel processing in the vernacular name matching algorihm.